Netflix Data Analysis Project - Advanced

Introduction

This is a project made in the context of the Tech Academy R for Data Science Advanced course at the Goethe University Frankfurt in 2022/2023.

The general Netflix data is from a kaggle dataset; I’m working with Version 5, last updated on September 27th, 2021.

In addition, I also use my own data, which anyone can receive from Netflix upon request.

Prep

Import libraries

library(tidyverse)
library(lubridate)
library(plotly)
library(ggExtra)
library(wordcloud)
library(tm)
library(SnowballC)
library(lsa)

Set directory

setwd("D:/R_TechAcademy")

Load dataset

I then load the Kaggle dataset as netflix_general

netflix_general <- read_csv("netflix_titles.csv")

Getting started

Discovering the Data

Let’s take a quick look at the data:

head(netflix_general)
## # A tibble: 6 × 12
##   show_id type    title     direc…¹ cast  country date_…² relea…³ rating durat…⁴
##   <chr>   <chr>   <chr>     <chr>   <chr> <chr>   <chr>     <dbl> <chr>  <chr>  
## 1 s1      Movie   Dick Joh… Kirste… <NA>  United… Septem…    2020 PG-13  90 min 
## 2 s2      TV Show Blood & … <NA>    Ama … South … Septem…    2021 TV-MA  2 Seas…
## 3 s3      TV Show Ganglands Julien… Sami… <NA>    Septem…    2021 TV-MA  1 Seas…
## 4 s4      TV Show Jailbird… <NA>    <NA>  <NA>    Septem…    2021 TV-MA  1 Seas…
## 5 s5      TV Show Kota Fac… <NA>    Mayu… India   Septem…    2021 TV-MA  2 Seas…
## 6 s6      TV Show Midnight… Mike F… Kate… <NA>    Septem…    2021 TV-MA  1 Seas…
## # … with 2 more variables: listed_in <chr>, description <chr>, and abbreviated
## #   variable names ¹​director, ²​date_added, ³​release_year, ⁴​duration

Data cleaning

The data is mostly pretty clean, but there is one error which I only realized later on, but we have to deal with it right away or it’ll mess up our results. There are 3 cells in the rating column that should actually be in duration:

netflix_general$duration[netflix_general$rating %in% c("66 min", "74 min", "84 min")] <- netflix_general$rating[netflix_general$rating %in% c("66 min", "74 min", "84 min")]
netflix_general$rating[netflix_general$rating %in% c("66 min", "74 min", "84 min")] <- NA

With that out of the way, let’s deal with data types. First, we need to get an overview of the data:

glimpse(netflix_general)
## Rows: 8,807
## Columns: 12
## $ show_id      <chr> "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9", "s1…
## $ type         <chr> "Movie", "TV Show", "TV Show", "TV Show", "TV Show", "TV …
## $ title        <chr> "Dick Johnson Is Dead", "Blood & Water", "Ganglands", "Ja…
## $ director     <chr> "Kirsten Johnson", NA, "Julien Leclercq", NA, NA, "Mike F…
## $ cast         <chr> NA, "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Mola…
## $ country      <chr> "United States", "South Africa", NA, NA, "India", NA, NA,…
## $ date_added   <chr> "September 25, 2021", "September 24, 2021", "September 24…
## $ release_year <dbl> 2020, 2021, 2021, 2021, 2021, 2021, 2021, 1993, 2021, 202…
## $ rating       <chr> "PG-13", "TV-MA", "TV-MA", "TV-MA", "TV-MA", "TV-MA", "PG…
## $ duration     <chr> "90 min", "2 Seasons", "1 Season", "1 Season", "2 Seasons…
## $ listed_in    <chr> "Documentaries", "International TV Shows, TV Dramas, TV M…
## $ description  <chr> "As her father nears the end of his life, filmmaker Kirst…
summary(netflix_general)
##    show_id              type              title             director        
##  Length:8807        Length:8807        Length:8807        Length:8807       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      cast             country           date_added         release_year 
##  Length:8807        Length:8807        Length:8807        Min.   :1925  
##  Class :character   Class :character   Class :character   1st Qu.:2013  
##  Mode  :character   Mode  :character   Mode  :character   Median :2017  
##                                                           Mean   :2014  
##                                                           3rd Qu.:2019  
##                                                           Max.   :2021  
##     rating            duration          listed_in         description       
##  Length:8807        Length:8807        Length:8807        Length:8807       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
## 

Most columns are set as having characters as a class, though that’s not what we would want when it comes to “date_added” and “duration”.

Since the “date_added” column is written in the format “September 22, 2001”, that is, “Month, Day, Year”, we can use the mdy function from the lubridate package to transform it into a date format we can work with.

Making date_added an actual date

# Convert date_added to date format
netflix_general$date_added <- mdy(netflix_general$date_added)
# Check to see if that worked
class(netflix_general$date_added)
## [1] "Date"

Splitting duration of Movies and TV Shows

We should also make “duration” numeric. This is a bit more complicated, because the original dataset mixes duration in minutes for films and duration in seasons for TV Shows. We should separate those two into different columns.

The first line in the mutate() function creates a new column called duration_movie. It fills that up with data by first checking to see if a row has “Movie” in the type column; if so, we call the gsub() function to substitute all uses of the characters “min” for nothing (essentially deleting it), and then we make those cells numeric. If the rows are not of the type “Movie”, they are assigned NA in duration_movie.

The second line in the mutate() function creates a new column called duration_season_number. It fills that up with data by first checking to see if a row has “TV Show” in the type column; if so, we call the gsub() function to substitute all uses of the characters “Season” or “Seasons” for nothing (again deleting those), and then we make those cells numeric. If the rows are not of the type “TV Show”, they are assigned NA in duration_season_number.

netflix_general <- netflix_general %>%
  mutate(duration_movie = as.numeric(gsub("min", "", ifelse(type == "Movie", duration, NA))),
         duration_season_number = as.numeric(gsub("[ Season Seasons]", "", ifelse(type == "TV Show", duration, NA))))

Let’s remove the duration column to avoid confusion later on.

netflix_general <- select(netflix_general, -'duration')

Give some overall statements

Overall numbers: movies and TV

How many movies and tv shows are included? Let’s make a barchart:

ggplot(netflix_general, aes(x = type)) + geom_bar() + 
  labs(x = NULL, y = NULL, title = "Number of Media on Netflix by Type")

Most represented country

Which country has the most content (movies and tv shows) featured on Netflix?

country_most <- netflix_general %>% 
  count(country) %>%
  slice_max(n)
message("The country with the most media on Netflix according to the dataset is the ", country_most[1, 1], ", with ", country_most[1, 2], " items.")
## The country with the most media on Netflix according to the dataset is the United States, with 2818 items.

Longest movie

What’s the longest movie (not TV show) included in the dataset?

longest_movie <- netflix_general %>% slice_max(duration_movie)
message("The longest movie in the dataset is ", longest_movie[1, 3], ", at ", longest_movie[1, 12], " minutes.")
## The longest movie in the dataset is Black Mirror: Bandersnatch, at 312 minutes.

This is actually an interesting case, since this film is a “choose your own adventure” special episode of the the anthology series Black Mirror. Since it is interactive, there is actually no set amount of time the film takes, though Netflix themselves affirm that the run time for default choices is only 90 minutes. This shows how individual cases can often defy simple characterizations like “duration”. The 312 minutes that we see here are likely the cumulative duration of all available scenes, though a single screening of the film could never be that long.

A bar chart of the top 10 longest movies duration

First, let’s take a look at what are the 10 longest films on Netflix

top_ten_length <- netflix_general %>%
  slice_max(duration_movie, n = 10)
print(top_ten_length)
## # A tibble: 10 × 13
##    show_id type  title   direc…¹ cast  country date_added relea…² rating liste…³
##    <chr>   <chr> <chr>   <chr>   <chr> <chr>   <date>       <dbl> <chr>  <chr>  
##  1 s4254   Movie Black … <NA>    Fion… United… 2018-12-28    2018 TV-MA  Dramas…
##  2 s718    Movie Headsp… <NA>    Andy… <NA>    2021-06-15    2021 TV-G   Docume…
##  3 s2492   Movie The Sc… Houssa… Suha… Egypt   2020-05-21    1973 TV-14  Comedi…
##  4 s2488   Movie No Lon… Samir … Said… Egypt   2020-05-21    1979 TV-14  Comedi…
##  5 s2485   Movie Lock Y… Fouad … Foua… <NA>    2020-05-21    1982 TV-PG  Comedi…
##  6 s2489   Movie Raya a… Hussei… Suha… <NA>    2020-05-21    1984 TV-14  Comedi…
##  7 s167    Movie Once U… Sergio… Robe… Italy,… 2021-09-01    1984 R      Classi…
##  8 s7933   Movie Sangam  Raj Ka… Raj … India   2019-12-31    1964 TV-14  Classi…
##  9 s1020   Movie Lagaan  Ashuto… Aami… India,… 2021-04-17    2001 PG     Dramas…
## 10 s4574   Movie Jodhaa… Ashuto… Hrit… India   2018-10-01    2008 TV-14  Action…
## # … with 3 more variables: description <chr>, duration_movie <dbl>,
## #   duration_season_number <dbl>, and abbreviated variable names ¹​director,
## #   ²​release_year, ³​listed_in

Now, let’s turn this into a bar chart (while the project guide calls this a histogram, it’s actually a bar chart since the data is discrete - each film is a separate entity with its own length - hence the gaps between each bar, which do not exist in a histogram).

ggplot(top_ten_length, aes(x = duration_movie, y = reorder(title, duration_movie))) + 
  geom_col() +
  labs(x = "Duration (min)", y = NULL, title = "The 10 Longest Movies on Netflix")

Movie duration mean and standard deviation

What is the mean and standard deviation of movie duration in minutes?

# mean
print(duration_mean <- mean(netflix_general$duration_movie, na.rm=T))
## [1] 99.565
# standard deviation
print(duration_sd <- sd(netflix_general$duration_movie, na.rm=T))
## [1] 28.2895

Visualizing average movie durations over time

The project guide suggests at this point that we analyze how the average movie length evolved with a graph.

Line graph of average durations

ggplot(netflix_general, aes(x = release_year, y = duration_movie)) + 
  geom_line(stat = "summary", fun = "mean")

We were asked to comment and interpret the graph - “were there any significant increases/decreases in movie length over time? If so, what could be the reason?”

I actually don’t believe the above graph of averages can tell us much about this, because the means hide the number of films they are based on per year. So I made a scatter plot with the film’s title as a hover text to better inspect the data. (You can count this as my “surprise us” plot)

Interactive scatter plot of movie durations

scatter_length <- ggplot(netflix_general, aes(x = release_year, y = duration_movie, text=title)) +  
  geom_point() +
  labs(x = "Release Year", y = "Duration (min)", title = "Netflix's movie durations over time")
ggplotly(scatter_length)

After actually checking the films in question, I believe this is just a quirk of the data, and cannot tell us anything about actual trends in the length of films produced throughout the decades. The Netflix dataset has very few movies up to the 1990s, making the length of individual films in this period skew the results. There is a cluster of World War 2 documentaries that have around 40 minutes that are bringing the average down in the 1940s, while in the 1960s a few epics like Doctor Zhivago are skewing durations up, but that is not necessarily representative of the average duration of films made in these periods. There are also quite a few “movies” bringing the average down in the last few decades that aren’t actually feature films, like “Power Rangers specials”, comedy specials, and special features (e.g. Creating the Queen’s Gambit).

Your personal data

In this portion, each participant used a data set based on their own viewing activity, which they requested from Netflix.

Load your data

my_data <- read_csv("D:/R_TechAcademy/NetflixReport/CONTENT_INTERACTION/ViewingActivity.csv")

Let’s see what it looks like:

glimpse(my_data)
## Rows: 17,161
## Columns: 10
## $ `Profile Name`            <chr> "Isa", "Isa", "Isa", "Isa", "Isa", "Isa", "I…
## $ `Start Time`              <dttm> 2022-06-27 19:02:15, 2022-06-27 18:56:05, 2…
## $ Duration                  <time> 01:29:35, 00:00:01, 01:29:21, 00:00:02, 00:…
## $ Attributes                <chr> NA, "Autoplayed: user action: User_Interacti…
## $ Title                     <chr> "Wallander: Series 1: Firewall (Episode 2)",…
## $ `Supplemental Video Type` <chr> NA, NA, NA, NA, NA, "TEASER_TRAILER", NA, NA…
## $ `Device Type`             <chr> "Amazon Fire TV Stick 2020 Lite Streaming St…
## $ Bookmark                  <time> 01:27:50, 00:00:01, 01:29:02, 00:00:02, 00:…
## $ `Latest Bookmark`         <chr> "01:27:50", "Not latest view", "01:29:02", "…
## $ Country                   <chr> "DE (Germany)", "DE (Germany)", "DE (Germany…

The data types are better than in netflix_general, but there are still some issues we need to solve.

Clean and transform dataset

The cleaning here involves several steps. First, we remove things that aren’t films by taking out anything that isn’t NA in “Supplemental Video Type”. This category is just for trailers, recaps, special features, etc. Then, we run into the issue that the column “Title” specifies TV show episodes by name, making it too granular and incompatible with the larger Netflix dataset we have, where series titles are isolated.

Since we will want to merge those two, they need to fit together. We need to split that column into 3 columns - ‘title’ (lower-case, as in ‘netflix_general’), ‘season’ and ‘episode_title’. This information is split on colons (:), but we can’t just split them like that here, because some films have colons on their names. So we first check that the column ‘title’ has words like “season” and equivalents to send that information to the corresponding column, and we do the same with the word “episode” and equivalents, pushing it to the column ‘episode_title’.

my_data <- my_data %>%
  #filter out supplemental videos
  filter(is.na(`Supplemental Video Type`)) %>% 
  #separate titles for TV show episodes
  separate(col=Title, into=c("title", "season", "episode_title"), 
           sep=': ', remove=TRUE) %>% 
  mutate(title=ifelse(grepl("Season", season) | 
                        grepl("Series", season) | 
                        grepl("Staffel", season) |
                        grepl("Volume", season) |                        
                        grepl("Temporada", season) |
                        grepl("Episode", episode_title) | 
                        grepl("Chapter", episode_title) |
                        grepl("Episódio", episode_title) | 
                        is.na(season),  title, paste(title, season, sep =": ")),
         # Identify how "start Time" is organized
         `Start Time`= ymd_hms(`Start Time`),
         # Split into columns for date, month, weekday and start time, which we'll need later
         viewing_date = date(`Start Time`),
         viewing_month = month(`Start Time`, label=TRUE),
         viewing_year = year(`Start Time`),
         viewing_weekday = wday(`Start Time`, label=TRUE),
         start_time = hms(format(as.POSIXct(`Start Time`), format = "%H:%M:%S")))%>%
  # Rename "Duration" column to watch_time to avoid confusion
  rename(watch_time = Duration)

The TechAcademy project guide says here that “Netflix recorded every time you clicked on a movie even if you didn’t watch it. Check which column indicates those with a specific value.” I imagine you are referring to the column “Attributes”, which marks if a film was autoplayed, but I don’t think this matters much, as an autoplayed film might still have been watched. I would agree that things with a very short watch_time might be better off removed to avoid bias, but looking at the data, it seems this often happens because of pauses, so that the cumulative watch_time should include those short bursts, so I’ll keep those in for now, and filter them out when needed.

Before we join the datasets, let’s remove the columns we don’t want from my_data so the joined data isn’t unnecessarily large.

my_data <- select(my_data, -c("Start Time", "Attributes", "Supplemental Video Type", "Device Type", "Bookmark", "Latest Bookmark", "Country"))

The only column now that still had spaces in the name was “Profile Name”. Let’s change that because it’s kind of annoying, sometimes causing problems with selection.

my_data <- my_data %>% 
  rename(profile_name = "Profile Name")

Join datasets

Now let’s join! Both datasets have a column called “title” which is reasonably clean now, so let’s use that.

netflix_combined <- my_data %>% 
  left_join(netflix_general, by = "title")

Interactive line plot

Our goal for this task is to plot how each viewer’s activity was recorded over time. First, we need to group the watch times per day for each viewer (this if you have different viewers in your account, as I do):

by_date <- netflix_combined %>% 
  group_by(profile_name, viewing_date) %>% 
  mutate(watchtime_per_day = as.period((sum(watch_time))))

TechAcademy recommended a dynamic chart here since it would be a bit unclear in a static plot, but I figured it would be even clearer as an interactive plot:

per_day_plot <- ggplot(by_date, aes(y = watchtime_per_day, x=viewing_date, color=profile_name))+
  geom_point()+
  geom_line()+
  scale_y_time(name = "Watch time per day")+
  scale_x_date(date_breaks = "9 month", date_labels = "%Y-%m")+ 
  theme(axis.text.x = element_text(angle= 45, hjust=1))
# Make it interactive  
ggplotly(per_day_plot)

My mother, Glaucia, seems to be the big binge watcher in the family! She has some extraordinary bursts (17h and 34 minutes on May 30th, 2021!) of activity that are almost suspiciously long. I’ll investigate that in a bit. The other users binge-watching activity is more normal, with peaks around 8h in a day. Glaucia, Caio and Isa (that’s me) all started using the Netflix account in 2017, but Samuel only started in April 2020 - perhaps a pandemic-related change in viewing habits?

netflix_combined %>% 
  filter(profile_name == "glaucia") %>% 
  filter(viewing_date == "2021-05-30")
## # A tibble: 44 × 22
##    profile_name watch_…¹ title season episo…² viewing_…³ viewi…⁴ viewi…⁵ viewi…⁶
##    <chr>        <time>   <chr> <chr>  <chr>   <date>     <ord>     <dbl> <ord>  
##  1 glaucia      22'36"   Sam … Tempo… #Assis… 2021-05-30 May        2021 Sun    
##  2 glaucia      22'51"   iCar… Tempo… Sonhos… 2021-05-30 May        2021 Sun    
##  3 glaucia      23'38"   iCar… Tempo… Não qu… 2021-05-30 May        2021 Sun    
##  4 glaucia      23'37"   iCar… Tempo… iHatch… 2021-05-30 May        2021 Sun    
##  5 glaucia      23'37"   iCar… Tempo… Sou su… 2021-05-30 May        2021 Sun    
##  6 glaucia      23'37"   iCar… Tempo… Coraçã… 2021-05-30 May        2021 Sun    
##  7 glaucia      23'41"   iCar… Tempo… A namo… 2021-05-30 May        2021 Sun    
##  8 glaucia      23'37"   iCar… Tempo… iWant … 2021-05-30 May        2021 Sun    
##  9 glaucia      23'41"   iCar… Tempo… Espion… 2021-05-30 May        2021 Sun    
## 10 glaucia      23'38"   iCar… Tempo… Quero … 2021-05-30 May        2021 Sun    
## # … with 34 more rows, 13 more variables: start_time <Period>, show_id <chr>,
## #   type <chr>, director <chr>, cast <chr>, country <chr>, date_added <date>,
## #   release_year <dbl>, rating <chr>, listed_in <chr>, description <chr>,
## #   duration_movie <dbl>, duration_season_number <dbl>, and abbreviated
## #   variable names ¹​watch_time, ²​episode_title, ³​viewing_date, ⁴​viewing_month,
## #   ⁵​viewing_year, ⁶​viewing_weekday

Ok, there’s definitely something weird going on, I doubt my mom is going on hour-long binges of iCarly! I asked her about this and she can’t think of anything other than her account info is being used by someone else that she definitely did not approve of. This suspicious activity continued for months, though it seems to have died down in mid-2022. We should change our password just the same.

Let’s get personal

Next, we’ll investigate my own viewing habits. What’s the longest movie I have ever watched on Netflix?

isa_longest <- netflix_combined %>%
  filter(profile_name == "Isa") %>%
  slice_max(duration_movie)
print(isa_longest)
## # A tibble: 3 × 22
##   profile_name watch_t…¹ title season episo…² viewing_…³ viewi…⁴ viewi…⁵ viewi…⁶
##   <chr>        <time>    <chr> <chr>  <chr>   <date>     <ord>     <dbl> <ord>  
## 1 Isa          14'03"    The … <NA>   <NA>    2019-12-17 Dec        2019 Tue    
## 2 Isa          49'56"    The … <NA>   <NA>    2019-12-01 Dec        2019 Sun    
## 3 Isa          23'38"    The … <NA>   <NA>    2019-11-30 Nov        2019 Sat    
## # … with 13 more variables: start_time <Period>, show_id <chr>, type <chr>,
## #   director <chr>, cast <chr>, country <chr>, date_added <date>,
## #   release_year <dbl>, rating <chr>, listed_in <chr>, description <chr>,
## #   duration_movie <dbl>, duration_season_number <dbl>, and abbreviated
## #   variable names ¹​watch_time, ²​episode_title, ³​viewing_date, ⁴​viewing_month,
## #   ⁵​viewing_year, ⁶​viewing_weekday

I’m embarrassed to say I tried getting through The Irishman on three different days and in the end I never finished the movie, it was too long…

Monthly viewing time in 2021

Now, we are supposed to analyze how my viewing time has changed throughout one year, 2021. First, let’s filter for my profile and the year 2021, then group by month and add the watch time up:

isa_2021_month_watchtime <- netflix_combined %>%
  filter(profile_name == "Isa" & viewing_year == 2021) %>%
  group_by(viewing_month) %>% 
  mutate(watchtime_per_month = sum(watch_time))

Now let’s select just the columns we need - the month and watch time per month - and remove duplicates:

isa_2021_month_watchtime <- select(isa_2021_month_watchtime, c("viewing_month", "watchtime_per_month"))
isa_2021_month_watchtime <- unique(isa_2021_month_watchtime)

Now let’s make a graph:

ggplot(isa_2021_month_watchtime, aes(x=viewing_month, y=watchtime_per_month)) + 
  geom_col() +
  labs(x = "Month", title = "Isa's Monthly Watch Times 2021") +
  scale_y_time(name = 'Watch time (hh:mm:ss)')

There is more variation than I was expecting. The dip in June and August could be explained as the summer months - better weather, more time outside - but the surge in July goes against that logic. I started a new job in April and was teaching from April to June, so that might contribute to the diminished watch times then.

Average per weekday

Now we want to analyze the viewing time of specific weekdays. On which days have I watched more Netflix, is there a peak?

We’ll use a similar formula as last time, but now we’ll group by weekday and we’ll do the mean instead of the sum:

isa_2021_weekday_watchtime <- netflix_combined %>%
  filter(profile_name =="Isa" & viewing_year == 2021) %>%
  group_by(viewing_weekday) %>% 
  mutate(watchtime_per_weekday = mean(watch_time))

Now let’s select just the columns we need - the day and watch time per month - and remove duplicates:

isa_2021_weekday_watchtime <- select(isa_2021_weekday_watchtime, c('viewing_weekday', 'watchtime_per_weekday'))
isa_2021_weekday_watchtime <- unique(isa_2021_weekday_watchtime)

Now let’s make a graph:

ggplot(isa_2021_weekday_watchtime, aes(x=viewing_weekday, y=watchtime_per_weekday)) + 
  geom_col() +
  scale_y_time(name='Watch time (hh:mm:ss)') +
  labs(x = "Weekday", title="Isa's Average Watch Times per Weekday 2021")

Saturday is second-to-last (after Monday) in my watch times, defying my preconceived notion that I’d watch more on the weekend - though Sunday is highest, as expected.

Binge watching

In this section, the goal should be to create a plot of my top 10 binge TV shows. We first want to filter for TV Shows and group by date and title. We then sum the watch time of each title per day.

binge_TV <- netflix_combined %>%
  filter(profile_name == "Isa" & type == "TV Show") %>%
  group_by(title, viewing_date) %>% 
  mutate(watchtime_per_session = sum(watch_time))

Now let’s select just the columns we need - title, day and watch time per session - and remove duplicates:

binge_TV <- select(binge_TV, c("title", "viewing_date", "watchtime_per_session"))
binge_TV <- unique(binge_TV)

Let’s take a look at the data

binge_TV[order(-binge_TV$watchtime_per_session),]
## # A tibble: 801 × 3
## # Groups:   title, viewing_date [801]
##    title                   viewing_date watchtime_per_session
##    <chr>                   <date>       <drtn>               
##  1 Stranger Things         2022-05-28   23472 secs           
##  2 Marvel's The Defenders  2017-08-27   23174 secs           
##  3 Orange Is the New Black 2019-08-08   19676 secs           
##  4 Squid Game              2021-10-15   18737 secs           
##  5 Suits                   2017-10-05   18289 secs           
##  6 Stranger Things         2017-11-01   17427 secs           
##  7 Better Call Saul        2022-05-29   17106 secs           
##  8 Stranger Things         2022-06-18   16934 secs           
##  9 Maniac                  2019-01-08   16613 secs           
## 10 Sex Education           2021-09-20   15863 secs           
## # … with 791 more rows

Sometimes the same TV Show appears multiple times, since we are counting the top binge sessions here, not TV Shows. There are several ways for us to define and analyze the “top 10 binge TV Shows”. I’m going to filter out the watch sessions that were very short - following Netflix’s own practices, a watch time of less than 2 minutes. I will then group the TV Shows by title and calculate the median watch time of that title.I think in order to identify the shows that were the most “binge-worthy” for me, the median watch time per session makes the most sense, as it indicates the typical session and isn’t swayed by outliers.

top_binge_TV <- binge_TV %>% 
    # Cut off value for movies not watched intentionally as defined by Netflix
  filter(watchtime_per_session > 120) %>%
  group_by(title) %>% 
  summarize(mean = mean(watchtime_per_session),
            sd = sd(watchtime_per_session),
            sum = sum(watchtime_per_session),
            median = median(watchtime_per_session))

Let’s see the top 10:

top10_binge_TV <- top_binge_TV %>%
  slice_max(median, n = 10)
print(top10_binge_TV)
## # A tibble: 10 × 5
##    title                          mean              sd sum         median    
##    <chr>                          <drtn>         <dbl> <drtn>      <drtn>    
##  1 Marvel's The Defenders         23174.000 secs   NA   23174 secs 23174 secs
##  2 Next in Fashion                 9662.667 secs 4232.  28988 secs  8660 secs
##  3 Stranger Things                 9199.200 secs 5797. 183984 secs  8311 secs
##  4 Bridgerton                      8361.143 secs 3644.  58528 secs  8303 secs
##  5 Halt and Catch Fire             9605.000 secs 5210.  28815 secs  7663 secs
##  6 Katla                           7831.000 secs 5507.  23493 secs  7630 secs
##  7 Spinning Out                    6840.500 secs 4042.  27362 secs  7408 secs
##  8 Chilling Adventures of Sabrina  7235.000 secs   NA    7235 secs  7235 secs
##  9 Invisible City                  7103.000 secs 3851.  14206 secs  7103 secs
## 10 The Hook Up Plan                7011.000 secs   NA    7011 secs  7011 secs

And here as a graph:

ggplot(top10_binge_TV, aes(x=reorder(title, median), y=median))+
  geom_col()+
  coord_flip()+
  scale_y_time(name='Median watch time (hh:mm:ss)') +
  labs(x="TV Show", title="Isa's Median Watch Times per TV Show Binge session")

I share my Netflix account with my boyfriend, and looking at this table makes me realize he is more of a binge watcher than me. The number one position by a wide margin - Marvel’s The Defenders - is something he watched by himself, while the second and third place Next in Fashion and Stranger Things, we watched together.

Scatterplot with marginal density

How has the watching behavior of my family developed on Netflix since we first started using it? We will visualize this via a scatterplot with marginal density. Include all the profile names for this task and make a visual comparison.

Let’s filter out the things watched for less than two minutes:

full_records <- netflix_combined %>% 
  filter(watch_time > 120)

And now let’s make a scatterplot:

scatter_marginal <- ggplot(data = full_records, 
       aes(x = viewing_date, y = watch_time)) +
  geom_point(aes(col = profile_name)) + 
  theme(legend.position = "bottom")

This is the code to add the marginal density:

print(ggMarginal(scatter_marginal, type = "density",
                   groupFill = TRUE,
                   groupColour = TRUE))

This plot is a bit too convoluted for my tastes and it’s kind of hard to interpret so many datapoints at once. Since Netflix records every time you click on something, sessions with fits and starts are recorded multiple times, creating a wall of points for some days. Most watch sessions are quite short - everything below an hour becomes a mass of points, and as we can see from the density plot on the right, they cluster in particular around the half-hour mark. From the density plot above, we confirm some of the information we already had, like the fact that Samuel only started watching using this account in 2020, but has been watching steadily since then.

Word cloud with my favorite genres

Let’s create a genre dataframe for my (Isa) genres:

isa_genres <- netflix_combined %>% 
  filter(profile_name == "Isa") %>% 
  select(listed_in) %>%
  separate_rows(listed_in, sep = ", ") %>%
  group_by(listed_in) %>% 
  summarize(freq=n())

And here’s the wordcloud:

#set seed so that wordcloud remains the same
set.seed(401)
#Wordcloud
wordcloud(words=isa_genres$listed_in, freq=isa_genres$freq, min.freq = 5, rot.per = 0.3,
                     max.words = 200, random.order = FALSE, colors = brewer.pal(6, "Dark2"))

Content-Based Recommendation System

Cleaning the dataset

Let’s select the categories we think are relevant for recommendation first. I went a bit above the requirements of the guide, which were cast, listed_in (aka genre) and description), because I think things like director, country and rating are also very valuable for a recommendation.

rec_data <- select(netflix_general, c("title", "director", "cast", "country", "listed_in", "rating", "description"))

Cleanup of terms with spaces

Now we’ll need to take the data in all columns which have spaces - apart from title (which is already one unit) and description (which has flowing text) - and exchange the spaces inside a term for underscores. This will keep terms that are a unit together and allow us to separate the others by space later on.

term_clean <- function(df) {
  dirty_column <- gsub(pattern= ", ", replacement = "/", x = df)
  dirty_column <- gsub(pattern= " ", replacement = "_" , x = dirty_column)
  clean_column <- gsub(pattern= "/", replacement = " " , x = dirty_column)
  return(clean_column)
}

Apply the function to columns 2 to 5:

rec_data <- rec_data %>% 
   mutate(across(2:5, term_clean))

Ratings cleanup

For the ratings column, I’m having some trouble in the next steps due to it containing a lot of terms that are not recognized as words in the tokenization process.Look at what happens when I call the termFreq function on the rating column:

termFreq(rec_data$rating)
##    nc-17    pg-13    tv-14     tv-g    tv-ma    tv-pg     tv-y    tv-y7 
##        3      490     2160      220     3207      863      307      334 
## tv-y7-fv 
##        6 
## attr(,"class")
## [1] "term_frequency" "integer"

Some of the most common ratings, like G, PG and R aren’t showing up. It seems only the terms with a hyphen appear. Compare the above list to the summary of the column as factor:

summary(as.factor(rec_data$rating))
##        G    NC-17       NR       PG    PG-13        R    TV-14     TV-G 
##       41        3       80      287      490      799     2160      220 
##    TV-MA    TV-PG     TV-Y    TV-Y7 TV-Y7-FV       UR     NA's 
##     3207      863      307      334        6        3        7

There is probably a better way to solve this (do let me know!), but I have opted to replace the terms that were being ignored to the full words the acronyms represent.

Since UR means “unrated”, which is synonymous with NR (“not rated”), I’ll change them both to “Not_Rated”. We could argue that both could change to NA, but having the label of not/un rated often means the films are more risqué, so let’s keep the value here.

# Create list of substitutions
rating_subs <- c("G" = "General", "NR" = "Not_Rated", "PG" = "Parental_Guidance", "R" = "Restricted", "UR" = "Not_rated")
# Apply substitutions
rec_data$rating <- replace(rec_data$rating, rec_data$rating %in% names(rating_subs), rating_subs)

Creating a corpus

Since we’ll need to repeat the same steps (or close to it) several times, it’s worth writing a function for the corpus. This takes a column at a time from our rec_data and first uses the functions VectorSource and VCorpus to term them into a corpus. Then, using DocumentTermMatrix means we split the corpus into separate words, each of which will become its own column

Corpus function

corpus_matrix <- function(term_column, control_list) {
  corpus_column <- VCorpus(VectorSource(term_column))
  column_words <- as.matrix(DocumentTermMatrix(corpus_column, control = control_list))
  rownames(column_words) <-  rec_data$title
  return(column_words)
}

Description control

When it comes to text like the ones found in the description column, we need to do quite a bit of cleanup. We can make all the needed changes into a list so it’s easier to add to the DocumentTermMatrix function.

Tokenizing means that each word will turn into a token and be given its own column. Setting language is important because the other functions are language dependent. Lowering the case avoids double counting of the same words and removing punctuation allows us to just focus on words. We can remove English stopwords (very common words that would be in most descriptions, like “the”, “and”, etc) and identify the words by their stems (e.g. “wait”, “waits”, “waited”, will all resolve to “wait”). Finally, bounds establish the minimum and maximum number of times a word must appear in the corpus to be taken into account. This diminishes the size of our data, which would otherwise get too big.

control_description <- list(
    tokenize = words,
    language="en",
    tolower = TRUE,
    removePunctuation = TRUE,
    stopwords = TRUE,
    stemming = TRUE,
    weighting = weightTfIdf,
    bounds = list(global = c(50, Inf))
)

Control for other columns

The controls for the other columns can be much more sparse. Genre, cast, country and rating are in practice a controlled vocabulary, meaning that there is a limited number of possible values and they are always written the same way. Actors and directors have a lot more possible values, but they are also always written the same way (given that this is a clean dataset). For that reason, we don’t need to lower the case or remove punctuation, and since these aren’t actual words, specifying language, stemming and adding stopwords are unnecessary.

Since we substituted the spaces inside terms that should be kept together with underscores, we can still tokenize based on “words”.

The “bounds” need to be very different from the ones used in the description, because terms with low frequencies here are still very relevant. I’ve decided to count for the recommendation anything that appears at least twice (only once is by definition useless for a recommendation). This number might seem too low, but my tests showed anything more than that would exclude a lot of useful information. For instance, very famous actors such as Tom Cruise and Barbra Streisand only have 2 movies in the Netflix dataset.Even younger stars like Emma Stone (9 films) and Ryan Gosling (4 films) have a small amount of films.

control_others <- list(
  tokenize = words, 
  weighting = weightTfIdf,
  bounds = list(global = c(2, Inf))
)

Creation of corpi

Now we just apply the corpus_matrix function we created to each of the columns using the appropriate controls and name the results accordingly.

corpus_description <- corpus_matrix(rec_data$description, control_description)
corpus_genre <- corpus_matrix(rec_data$listed_in, control_others)
corpus_cast <- corpus_matrix(rec_data$cast, control_others)
corpus_director <- corpus_matrix(rec_data$director, control_others)
corpus_country <- corpus_matrix(rec_data$country, control_others)
corpus_rating <- corpus_matrix(rec_data$rating, control_others)

Similarity function

Now we should call the cosine function from the lsa package on our matrixes in order to calculate the similarity (from 0 to 1) between all titles in each of the categories. Because I left the bounds of cast and director at minimum 2 mentions, this would, however, be extremely time consuming - from my estimates, it would take over four days.

To avoid this, I’ve created the following function which allows us to do the similarity computations title by title. Since we want to generate recommendations based on individual films, we don’t need to have the results of all cosine similarities in advance; we can just call the following function so that the film for which we want a recommendation is compared to all other films.

similarity <- function(title, corpus){
  # Creates empty matrix with number of rows for all titles and 1 column, for the selected title
  sim_matrix <- matrix(nrow = nrow(corpus), ncol = 1)
  # sets the titles for all as row names
  rownames(sim_matrix) <- rec_data$title
  # sets the column name as the selected title we'll generate the recommendations for
  colnames(sim_matrix) <-  title
  # loops over the corpus comparing each row with a title to the same title we want recs for and adds results to empry matrix 
  for (i in 1:nrow(corpus)) {
    sim_matrix[i, title] <- cosine(corpus[title, ], corpus[i, ])
  }
  return(sim_matrix)
}

Finally, we can create a recommendation function inside which we call the similarity function for that specific title in each of the corpi we created. To make things more flexible, I’ve added weights for each of the categories and the number of recommendations in the function call. They have default values which match what matters to me personally most of the time - for instance, director is more important than any other category. The default number of recommendations is the one suggested by Tech Academy (10).

Recommendation function

# The weights can be used so that the user can decide what things matter most to them regarding recommendations
# n dictates the number of recommendations, with default 10
recommendation <- function(title, n = 10, description_wt = 1, genre_wt = 2, cast_wt = 2, director_wt = 3, country_wt = 1, rating_wt = 1){
# Create a dataframe where each column applies the similarity function to a given corpus, multplying each by its designated weight
similarity_df <- data.frame(
  description_wt * similarity(title, corpus_description), 
  genre_wt * similarity(title, corpus_genre), 
  cast_wt * similarity(title, corpus_cast),
  director_wt * similarity(title, corpus_director),
  country_wt * similarity(title, corpus_country),
  rating_wt * similarity(title, corpus_rating))
# Assign column names matching each category
colnames(similarity_df) <- c("sim_description", "sim_genre", "sim_cast", "sim_director", "sim_country", "sim_rating")
# Replace NAs so the mean won't be NA
similarity_df <- replace(similarity_df,is.na(similarity_df), 0)
# Create new column with the sum similarity of the other columns
similarity_df$rec_value <- apply(similarity_df, 1, sum)
# Divide the mean column by the sum of the weights added for consistency
similarity_df$rec_value <- similarity_df$rec_value / sum(description_wt, genre_wt, cast_wt, director_wt, country_wt, rating_wt)
# To get the top 10 recommendations, we want to arrange the mean column in descending order
#And get the films in positions 2 to n + 1 (default will be 11), because number 1 is always going to be the film itself
similarity_df <- arrange(similarity_df, desc(rec_value)) %>% slice(2 : (n + 1))
return(similarity_df)
}

Recently Watched

To define which movies and tv shows I watched last - the results of which I will apply the recommendation function to, let’s go back to my netflix_combined dataset and see what are the last 5 things I have watched. By selecting for !is.na(type), I’m already making sure the item is present in netflix_general as well, because the type column would be empty otherwise. We’ll also need to remove duplicate titles, because every time we pause things are recorded again, and this is even more important when dealing with TV shows, since they’re often watched several episodes at a time).

recent_watches <- netflix_combined %>% 
  filter(profile_name == "Isa", !is.na(type), !duplicated(title)) %>% 
    arrange(desc("viewing_date")) %>% 
    slice_max(viewing_date, n = 5)
print(recent_watches <- select(recent_watches, title))
## # A tibble: 5 × 1
##   title           
##   <chr>           
## 1 Love & Anarchy  
## 2 Stranger Things 
## 3 Jaws            
## 4 Russian Doll    
## 5 Nobody's Looking

Get recommendations for recent watches

I could probably also do this in a for loop to make this more elegant, but for now, here are the recommendations for the last five things I watched:

First Recommendation - Love & Anarchy

My most recent watch is a Swedish TV Show called “Love & Anarchy”. I could call the recommendation function directly on its title, but to make my code more usable to others, I’ll use “toString(recent_watches[1, 1]” instead.

Since I wrote a function which allows us to weigh each parameter differently, let’s take advantage of this here. Let’s say I’m learning Swedish and therefore would want country to be weighed more heavily in my recommendations, but I don’t particularly care about the director:

# Call recommendation function for most recently watched item with country_wt=3
print(first_rec <- recommendation(title = toString(recent_watches[1, 1]), country_wt = 3))
##                                     sim_description sim_genre  sim_cast
## The Most Beautiful Hands of Delhi        0.00000000 1.1594342 0.7696367
## Young Royals                             0.20108371 1.3403484 0.0000000
## Gentlemen and Gangsters                  0.00000000 1.2590645 0.0000000
## Red Dot                                  0.05586068 0.0000000 1.1864482
## Caliphate                                0.00000000 0.4982726 0.5932241
## Jag älskar dig: En skilsmässokomedi      0.00000000 0.0000000 1.0884306
## Fallet                                   0.00000000 1.0791827 0.0000000
## Bonus Family                             0.18208244 0.4982726 0.0000000
## Quicksand                                0.00000000 0.3610814 0.0000000
## David Batra: Elefanten i rummet          0.16867863 0.0000000 0.0000000
##                                     sim_director sim_country sim_rating
## The Most Beautiful Hands of Delhi              0           3          1
## Young Royals                                   0           3          1
## Gentlemen and Gangsters                        0           3          1
## Red Dot                                        0           3          1
## Caliphate                                      0           3          1
## Jag älskar dig: En skilsmässokomedi            0           3          1
## Fallet                                         0           3          1
## Bonus Family                                   0           3          1
## Quicksand                                      0           3          1
## David Batra: Elefanten i rummet                0           3          1
##                                     rec_value
## The Most Beautiful Hands of Delhi   0.4940892
## Young Royals                        0.4617860
## Gentlemen and Gangsters             0.4382554
## Red Dot                             0.4368591
## Caliphate                           0.4242914
## Jag älskar dig: En skilsmässokomedi 0.4240359
## Fallet                              0.4232652
## Bonus Family                        0.3900296
## Quicksand                           0.3634234
## David Batra: Elefanten i rummet     0.3473899

I kind of hate the way dataframes get printed in R Markdown, is there a way to make them nicer-looking and more organized?

To make the results a bit cleaner, we can also print just the titles in a message:

message("If you enjoyed '", recent_watches[1, 1], "' you might also like '", rownames(first_rec)[1], "', '", rownames(first_rec)[2], "', '", rownames(first_rec)[3], "', '", rownames(first_rec)[4], "', '", rownames(first_rec)[5], "', '", rownames(first_rec)[6], "', '", rownames(first_rec)[7], "', '", rownames(first_rec)[8], "', '", rownames(first_rec)[9], "' and '", rownames(first_rec)[10], "'.")
## If you enjoyed 'Love & Anarchy' you might also like 'The Most Beautiful Hands of Delhi', 'Young Royals', 'Gentlemen and Gangsters', 'Red Dot', 'Caliphate', 'Jag älskar dig: En skilsmässokomedi', 'Fallet', 'Bonus Family', 'Quicksand' and 'David Batra: Elefanten i rummet'.

The results make a lot of sense. They’re all Swedish TV Shows, and the ones that have similar genres or overlapping cast with “Love & Anarchy” are higher up on the recommendation list.

Second Recommendation - Stranger Things

Let’s call this one with just the default weights:

print(second_rec <- recommendation(toString(recent_watches[2, 1])))
##                                sim_description sim_genre sim_cast sim_director
## Beyond Stranger Things              0.06084501 1.2688609 1.678333            0
## Chilling Adventures of Sabrina      0.00000000 2.0000000 0.000000            0
## Manifest                            0.18530829 1.5064163 0.000000            0
## The Messengers                      0.06591980 1.5064163 0.000000            0
## The Originals                       0.00000000 1.5278668 0.000000            0
## The Vampire Diaries                 0.00000000 1.5064163 0.000000            0
## Nightflyers                         0.08213150 2.0000000 0.000000            0
## Mystery Science Theater 3000        0.07529745 0.9996219 0.000000            0
## Charmed                             0.00000000 1.0247657 0.000000            0
## Star-Crossed                        0.12360338 0.8775475 0.000000            0
##                                sim_country sim_rating rec_value
## Beyond Stranger Things                   1          1 0.5008039
## Chilling Adventures of Sabrina           1          1 0.4000000
## Manifest                                 1          1 0.3691725
## The Messengers                           1          1 0.3572336
## The Originals                            1          1 0.3527867
## The Vampire Diaries                      1          1 0.3506416
## Nightflyers                              1          0 0.3082131
## Mystery Science Theater 3000             1          1 0.3074919
## Charmed                                  1          1 0.3024766
## Star-Crossed                             1          1 0.3001151
## If you enjoyed 'Stranger Things' you might also like 'Beyond Stranger Things', 'Chilling Adventures of Sabrina', 'Manifest', 'The Messengers', 'The Originals', 'The Vampire Diaries', 'Nightflyers', 'Mystery Science Theater 3000', 'Charmed' and 'Star-Crossed'.

These also make a lot of sense. Specials like “Beyond Stranger Things” are obvious recommendations for fans of the show, while almost all the subsequent recommendations involve young casts embroiled in supernatural events.

Third Recommendation - Jaws

There are a lot of movies in the dataset by Steven Spielberg. Let’s say I’m ok with having other films of his be recommended, but I really like Jaws and want other films like it to be recommended to me, and not just Spielberg films which might be very different from Jaws. I don’t want to be limited to American movies, and I want the similarity in plot and genre to be more important.

print(third_rec <- recommendation(toString(recent_watches[3, 1]), director_wt = 1, country_wt = 0, description_wt = 3, genre_wt = 3))
##                                      sim_description sim_genre sim_cast
## Schindler's List                           0.0000000  2.666841        0
## Indiana Jones and the Temple of Doom       0.0000000  2.560812        0
## Rain Man                                   0.5667057  2.666841        0
## Daughters of the Dust                      0.7413321  2.343128        0
## Bonnie and Clyde                           0.0000000  3.000000        0
## Rocky                                      0.0000000  3.000000        0
## Superfly                                   0.0000000  3.000000        0
## The Outlaw Josey Wales                     0.0000000  2.901906        0
## Monty Python and the Holy Grail            0.0000000  2.749315        0
## The Last Days of Chez Nous                 0.0000000  2.666841        0
##                                      sim_director sim_country sim_rating
## Schindler's List                                1           0          1
## Indiana Jones and the Temple of Doom            1           0          1
## Rain Man                                        0           0          1
## Daughters of the Dust                           0           0          1
## Bonnie and Clyde                                0           0          1
## Rocky                                           0           0          1
## Superfly                                        0           0          1
## The Outlaw Josey Wales                          0           0          1
## Monty Python and the Holy Grail                 0           0          1
## The Last Days of Chez Nous                      0           0          1
##                                      rec_value
## Schindler's List                     0.4666841
## Indiana Jones and the Temple of Doom 0.4560812
## Rain Man                             0.4233546
## Daughters of the Dust                0.4084460
## Bonnie and Clyde                     0.4000000
## Rocky                                0.4000000
## Superfly                             0.4000000
## The Outlaw Josey Wales               0.3901906
## Monty Python and the Holy Grail      0.3749315
## The Last Days of Chez Nous           0.3666841
## If you enjoyed 'Jaws' you might also like 'Schindler's List', 'Indiana Jones and the Temple of Doom', 'Rain Man', 'Daughters of the Dust', 'Bonnie and Clyde', 'Rocky', 'Superfly', 'The Outlaw Josey Wales', 'Monty Python and the Holy Grail' and 'The Last Days of Chez Nous'.

Director is still being counted here, but only movies from Spielberg that also have many genre overlaps with Jaws showed up. The fact that Schindler’s List has such a high genre overlap with Jaws shows the limitations with the genre descriptions, however. Still, most of the films here make a lot of sense as reccommendations.

Fourth Recommendation - Russian Doll

Let’s say my favorite thing about Russian Doll is the cast. I want recs that also have the actors in them, and the rest doesn’t matter much to me:

print(fourth_rec <- recommendation(toString(recent_watches[4, 1]), description_wt = 0, genre_wt = 0, cast_wt = 1, director_wt = 0, country_wt = 0, rating_wt = 0))
##                                           sim_description sim_genre  sim_cast
## Valor                                                   0         0 0.4259042
## MFKZ                                                    0         0 0.1979885
## Orange Is the New Black                                 0         0 0.1769547
## Gimme Shelter                                           0         0 0.1597748
## Tales of the City                                       0         0 0.1397574
## Sleeping with Other People                              0         0 0.1383546
## Bad Boys II                                             0         0 0.1306690
## Zodiac                                                  0         0 0.1298259
## The Super                                               0         0 0.1271272
## True Memoirs of an International Assassin               0         0 0.1250339
##                                           sim_director sim_country sim_rating
## Valor                                                0           0          0
## MFKZ                                                 0           0          0
## Orange Is the New Black                              0           0          0
## Gimme Shelter                                        0           0          0
## Tales of the City                                    0           0          0
## Sleeping with Other People                           0           0          0
## Bad Boys II                                          0           0          0
## Zodiac                                               0           0          0
## The Super                                            0           0          0
## True Memoirs of an International Assassin            0           0          0
##                                           rec_value
## Valor                                     0.4259042
## MFKZ                                      0.1979885
## Orange Is the New Black                   0.1769547
## Gimme Shelter                             0.1597748
## Tales of the City                         0.1397574
## Sleeping with Other People                0.1383546
## Bad Boys II                               0.1306690
## Zodiac                                    0.1298259
## The Super                                 0.1271272
## True Memoirs of an International Assassin 0.1250339
## If you enjoyed 'Russian Doll' you might also like 'Valor', 'MFKZ', 'Orange Is the New Black', 'Gimme Shelter', 'Tales of the City', 'Sleeping with Other People', 'Bad Boys II', 'Zodiac', 'The Super' and 'True Memoirs of an International Assassin'.

All of the results have at least one cast member in common with Russian Doll. One limitation of this recommendation system is that the cosine similarity of the categories is different depending on the number of people named in the column. So, for example, two films might have a single cast member in common with Russian Doll, but their similarity values will be different if the length of their casts on the netflix dataset is different. This is not necessarily a bad thing, since the smaller the named cast the more likely it is for that actor to have a big role, and if I want to see a film or TV show because of a specific person, I’d want their role to be prominent.

Fifth Recommendation - Nobody’s Looking

Now let’s say I’m getting overwhelmed with all these recommendations, and I want to limit their number to 5.

print(fifth_rec <- recommendation(toString(recent_watches[5, 1]), n = 5))
##                sim_description sim_genre sim_cast sim_director sim_country
## Super Drags         0.10991124  1.607135        0            0           1
## Samantha!           0.00000000  1.607135        0            0           1
## Borges              0.00000000  1.607135        0            0           1
## Invisible City      0.12034896  1.086888        0            0           1
## Sisters             0.09540587  2.000000        0            0           0
##                sim_rating rec_value
## Super Drags             1 0.3717046
## Samantha!               1 0.3607135
## Borges                  1 0.3607135
## Invisible City          1 0.3207237
## Sisters                 1 0.3095406
## If you enjoyed 'Nobody's Looking' you might also like 'Super Drags', 'Samantha!', 'Borges', 'Invisible City' and 'Sisters'.

These are all Brazilian shows with similar genres, makes sense!

And, that’s a wrap!